Monitoring a replica set

To learn what instances belong to the replica set and obtain statistics for all these instances, issue a box.info.replication request:

tarantool> box.info.replication
---
  replication:
    1:
      id: 1
      uuid: b8a7db60-745f-41b3-bf68-5fcce7a1e019
      lsn: 88
    2:
      id: 2
      uuid: cd3c7da2-a638-4c5d-ae63-e7767c3a6896
      lsn: 31
      upstream:
        status: follow
        idle: 43.187747001648
        peer: replicator@192.168.0.102:3301
        lag: 0
      downstream:
        vclock: {1: 31}
    3:
      id: 3
      uuid: e38ef895-5804-43b9-81ac-9f2cd872b9c4
      lsn: 54
      upstream:
        status: follow
        idle: 43.187621831894
        peer: replicator@192.168.0.103:3301
        lag: 2
      downstream:
        vclock: {1: 54}
...

This report is for a master-master replica set of three instances, each having its own instance id, UUID and log sequence number.

The request was issued at master #1, and the reply includes statistics for the other two masters, given in regard to master #1.

The primary indicators of replication health are:

idle: the time (in seconds) since the instance received the last event from a master.

If the master has no updates to send to the replicas, it sends heartbeat messages every replication_timeout seconds. The master is programmed to disconnect if it does not see acknowledgments of the heartbeat messages within replication_timeout * 4 seconds.

Therefore, in a healthy replication setup, idle should never exceed replication_timeout: if it does, either the replication is lagging seriously behind, because the master is running ahead of the replica, or the network link between the instances is down.
lag: the time difference between the local time at the instance, recorded when the event was received, and the local time at another master recorded when the event was written to the write ahead log on that master.

Since the lag calculation uses the operating system clocks from two different machines, do not be surprised if it’s negative: a time drift may lead to the remote master clock being consistently behind the local instance’s clock.

For multi-master configurations, lag is the maximal lag.

For better understanding, see the following diagram illustrating the upstream and downstream connections within the replica set of three instances:

Version:

Monitoring a replica set